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Abstract. Let be the following strategy to construct a walk in a labeled 
digraph: at each vertex, we follow the unvisited arc of minimum label. In 
this work we study for which languages, applying the previous strategy 
over the corresponding de Bruijn graph, we finish with an Eulerian cycle, 
in order to obtain the minimal de Bruijn sequence of the language. 

1 Introduction 

Given a language, a de Bruijn sequence of span n is a periodic sequence such that 
every n-tuple in the language (and no other n-tuple) occurs exactly once. Its first 
known description appears as a Sanskrit word yamdtdrdjabhdnasalagdm which 
was a memory aid for Indian drummers, where the accented/unaccented syllables 
represent long/shorts beats, so all possible triplets of short and long beats are 
included in the word. De Bruijn sequences are also known as "shift register 
sequences" and was originally studied by N. G. De Bruijn for the binary alphabet 
[1]. These sequences have many different applications, such as memory wheels 
in computers and other technological device, network models, DNA algorithms, 
pseudo-random number generation, modern public-key cryptographic schemes, 
to mention a few (see [2], [3], [4]). Historically, de Bruijn sequence was studied 
in an arbitrary alphabet considering the language of all the n-tuples. There 
is a large number of de Bruijn sequence in this case, but only a few can be 
generated efficiently, see [5] for a survey about this subject. In 1978, Fredricksen 
and Maiorana [6] give an algorithm to generate a de Bruijn sequence of span n 
based in the Lyndon words of the language, which resulted to be the minimal 
one in the lexicographic order, and this algorithm was proved to be efficient [7]. 
Recently, the study of these concepts was extended to languages with forbidden 
substrings: in [8] it was given efficient algorithms to generate all the words in a 
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language with one forbidden substring, in [9] the concept of de Bruijn sequences 
was generalized to restricted languages with a finite set of forbidden substrings 
and it was proved the existence of these sequences and presented an algorithm 
to generate one of them, however, to find the minimal sequence is a non-trivial 
problem in this more general case. This problem is closely related to the "shortest 
common super-string problem" which is a important problem in the areas of 
DNA sequencing and data compression. 

In this work wc study the de Bruijn sequence of minimal lexicographical 
label. In section 2 we present some definitions and previous results on de Bruijn 
sequences and the BEST Theorem, necessary to understand the main problem, 
and we prove a result related with the BEST Theorem which will be useful in the 
following sections. In section 3 we study the main problem, giving some results 
on the structure of the de Bruijn graph. Finally, in section 4 we present some 
remarks and extensions of this work. 

2 De Bruijn Sequence of Restricted Languages 
2.1 Definitions 

Let A be a finite set with a linear order <. A word on the alphabet A is a finite 
sequence of elements of A, whose length is denoted by |ty|. 

A word p is said to be a factor of a word w if there exist words u,v E A* 
such that w — upv. If u is the empty word e then p is called a prefix of w, and if 
v is empty then is called a suffix of w. lip^w then p is a proper factor, proper 
prefix or proper suffix, respectively. 

The set A* of all the words on the alphabet A is linearly ordered by the 
alphabetic order induced by the order < on A. By definition, x < y either if x 
is a prefix of y or if x = uav, y = ubw with u, v, w G A*, a, b £ A and a < b. A 
basic property of the alphabetic order is the following: if x < y and if x is not a 

prefix of y, then for any pair of words it, v, xu < yv. 
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Given an alphabet A, a full shift A is the collection of all bi-infinite se- 
quences of symbols from A. Let J 7 be a set of words over A*. A subshift of finite 
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type (SFT) is the subset of sequences in A which does not contain any factor 
in T . We will refer to T as the set of forbidden blocks or forbidden factors. 

Given a set T of forbidden blocks, in this work we will say that a word w is 
in the language if the periodical word w°° , composed by infinite repetitions of 
w, is in the language of the SFT defined by T . The set of all the words of length 
n in the language defined by T will be denoted by W^n). 

A SFT is irreducible if for every ordered pair of blocks u, v in the language 
there is a block w in the language so that uwv is a block of the language. 

A de Bruijn sequence of span n in a restricted language is a circular string 
B F,n of length [W^n)] such that all the words in the language of length n are 
factors of B^' 71 . In other words, 

{{B^ n )i . . . (B^i+n-lmodnli = . . .n - 1} = (n) 



These concepts are studied in [9], extending the known results on subshifts 
of finite type to this context. In particular two results are relevant in this work, 
the first one is a bound in the number of words of length n in the language: 

\W^(n)\ =0(A") 

where log(A) is the entropy of the system (see [10]). The second result proves 
the existence of a de Bruijn sequence: 

Theorem 1. For any set of forbidden substrings T defining an irreducible sub- 
shift of finite type, there exists a de Bruijn sequence of span n. 

This last theorem is a direct consequence of the fact that the de Bruijn 
graph of span n is an Eulerian graph. The de Bruijn graph of span n, denoted 
by G^' 71 , is the largest strongly connected component of the directed graph with 
\A\ n vertices, labeled by the words in A n , and the set of arcs 

E = {(as, sb)\a, beA,sE A n ~ 1 ,asb e W^(n + 1)} 

where the label of the arc e = (as,sb) is 1(e) = b. Note that if the SFT is 
irreducible, this graph has only one strongly connected component of size greater 
than 1, so there is no ambiguity in the definition. 

There are not two vertices with the same label, hence from now we identify 
a vertex by its label. If W = e\ . . . is a walk over G^'™, we denote the label of 
W by l(W) — l(e\) . . . l(ek), and by l(Wy the concatenation of of j times l(W). 

There exists a bijection between the arcs of G :F ' n and the words in W^n+l), 
because to each arc with label a £ A with tail at w' € A n we can associate the 
word w'a which is, by definition, a word in (n + 1). Equally if w'a is a word 
of W^(n + 1), with a e A, then there exists a vertex w' and an arc with tail at 
this vertex with label a. 

Furthermore, if a word w is a label of a walk from u to v then v is a suffix 
of length n of uw. In the same way, if w € W JF (n + 1) then there is a cycle G in 
G T - n with label 1(C) such that 1(C) W = w . 

With all these properties it is easy to see that a de Bruijn sequence of span 
n + 1 is exactly the label of an Eulerian cycle over G F ' n . 

2.2 The BEST Theorem 

BEST is an acronym of N. G. de Bruijn, T. van Aardenne-Ehrenfest, C. A. B. 
Smith and W. T. Tutte, the BEST Theorem (see [11]) gives a correspondence 
between Eulerian cycles in a digraph and its rooted trees converging to the root 
vertex. 

Let r be a vertex of an Eulerian digraph G = (V,E), a spanning tree con- 
verging to the root r is a spanning tree such that there exists a directed path 
from each vertex to the root. 

Given an Eulerian cycle starting at the root of an Eulerian digraph, if for 
every vertex of G we take the last arc with tail at this vertex in the cycle then 








Fig. 1. De Bruijn digraph of span 5 for the Golden Mean (T = {11}) 



we obtain a spanning tree converging to the root. Conversely, given a spanning 
tree converging to the root, a walk over G starting at the root and using the 
arc in the tree only if all the arcs with tail at this vertex has been used, is an 
Eulerian cycle. A walk over the graph of this kind will be called a walk "avoiding 
the tree" . 

The BEST Theorem proves that for every different spanning tree we have a 
different Eulerian cycle. Therefore it also allows us to calculate the exact number 
of Eulerian cycles on a digraph, which is given by 

\v\ 

c'r^MT-Hid+iv^-iy. 

where Mt is the number of rooted spanning trees converging to a given vertex. 
We bound the second term by ({d + — where d + is the mean of the outgoing 

degrees over all the vertices, so we have a lower bound to the number of de Bruijn 
sequences 

cy^LA-ij!^ 1 ) 

in particular, for a system with A > 3 the number of the Bruijn sequences of 
span n is exponential in the number of words in the language of length n — 1. 
In the systems with 3 > A > 1 this bound is generally also true, because the 
underestimated term Mt is generally exponential, for example, in the system 
without restrictions of alphabet {0, 1}, this term is equal to 2 2 



Now, we define formally a walk "avoiding a subgraph" . Let r be any vertex. 
For each vertex v ^ r in G^' 71 let e v be any arc starting at v. Let H be the 
spanning subgraph of G^' 71 with arc set {e v : v G l / (G JF '™) \ {r}}. 

Is easy to see that H is composed by cycles, subtrees converging to a cycle, 
and one subtree converging to r. For a vertex not in a cycle of H , we define H v 
as the directed subtree converging to v in H. 

We define recursively a walk in G^' n which avoid H. It starts at the root 
vertex r. Let voeo ■ ■ ■ be the current walk. If there is an unvisited arc = 
(vi,v i+ i) not in H we extend the walk by e^Wi+i. Otherwise we use the arc e Vi 
in ff. 

We say that a walk over the graph exhausts a vertex if the walk use all the 
arc having the vertex as head or tail. 

The next lemma studies in which order the vertices are exhausted in a walk 
avoiding H 

Lemma 2. Let W be a walk starting at vertex r avoiding H , let v be a vertex 
and let Wv the subpath of W starting at vertex r and finishing when it exhausts 
the vertex v. Then for each vertex u in H v , u is exhausted in Wv. 

Proof. By induction in the depth of the subtree with root v. If v is a leaf of 
H then H v = {v}. If v is not a leaf and Wv exhaust v, then Wv visit all arc 
(v, w) G E, and therefore all the arcs (it, v) 6 E, applying induction hypothesis 
to all vertices u such that (u, v) e E we prove the result. □ 



3 Minimal de Bruijn Sequence 

Let m = mi, . . . to„ be the vertex of G^'™ of maximum label in the lexicographic 
order. We are interested in to obtain the Eulerian cycle of minimum label starting 
at to. In order to obtain this cycle, we define the following walk: Starting at to, at 
each vertex we continue by the arc with the lowest label between the unvisited 
arcs with tail at this vertex. A walk constructed by this way will be called a 
minimal walk. By definition, there is no walk with a lexicographically lower 
label, except its subwalks. In this section we characterize when a minimal walk 
starting at to is an Eulerian cycle, obtaining the minimal de Bruijn sequence. 

For each vertex v let e(v) be the arc with tail at the vertex v and with 
maximum label. Let T be the spanning subgraph of G^' 71 composed by the set 
of arcs e(v), for v E V(G :F ' n ), v ^ m. The label of e(v) will be denoted by j(v). 

Is easy to see that a minimal walk is a walk avoiding T, hence we can study 
a minimal walk analyzing the structure of T . 

Theorem 3. A minimal walk is an Eulerian cycle if and only if T is a tree. 

Proof. A minimal walk W exhaust to, if T is a tree then by Lemma 2 all vertices 
of T are exhausted by W, hence W is an Eulerian cycle. Conversely, if W is an 
Eulerian cycle, by the BEST Theorem the subgraph composed by the last arc 
visited at each vertex is a tree, but this subgraph is T, concluding that T is a 
tree. □ 



In the unrestricted case (when W^n) = A n ), the subgraph T is a regular 
tree of depth n where each non-leaf vertex has \A\ sons, therefore the minimal 
walk is an Eulerian cycle. 

In the restricted case, we do not obtain necessarily an Eulerian cycle, because 
T is not necessarily a spanning tree converging to the root due to the existence 
of cycles. 

We will study the structure of the graph G T ' n and the subgraph T, specially 
the cycles in T. The main theorem of this section characterizes the label of cycles 
in T, allowing us to characterize the languages where the minimal walk is an 
Eulerian cycle. 

First of all, we will prove some properties of the de Bruijn graph to under- 
stand the structure of the arcs and cycles in T. 

Lemma 4. Let k > n + 2. Let W = v eoViei ■ ■ ■ ek-iVk be a walk in T. Then 
l(e ) < l{e n+ i). 

Proof. Since v n = l(eo) ■ ■ ■ Z(e„_i) we have that l(e\) ■ ■ ■ l{e n -\)l(e n )l(eo) € 
W r (n + 1). Hence there exists an arc (v n+ i,u) with label l(e ), where v n+ i = 
i(ei) • • • Z(e n _i)Z(e„). By the definition of T, Z(e ) < 7K1+1) = i(e n +i). □ 

Corollary 5. Let C be a cycle in T. Then \C\ divides n+1. Moreover for every 

Tl + l 

vertex u in C, u~f(u) — 1(C) i c i . 

Proof. Let consider the walk W = v a e ■ ■ ■ eicf|_iU|cr| = fo e o 1 1 ■ e (n+i)\c\-i v o e a v i 
as n + 1 repetitions of the cycle C. From Lemma 4 we have l(eo) < Z(e„+i) < 
'( e 2(n+i)) < K e (n+\)\c\) = K e o)- Since we can start the cycle in any vertex we 
conclude that Z(e») = l(e/ n+ u + i) for every i = 0, . . . , |C| — 1. Hence C divides 
n + 1. The second conclusion comes from the fact that the label of any walk of 
length at most n ending in a vertex u is a suffix of u. □ 

Let « / m be a vertex. Among all the words which are prefix of m and suffix 
of u, let g(u) be the longest one (notice that g(u) could be the empty word e 
and \g(u)\ < n). Let a(u) = m| fl ( u )| + i be the letter following the end of g(u) in 
m. 

Notice that in the unrestricted case, |<?(u)| is the distance over the graph 
from the vertex u to m. This function will be essential in the study of T. The 
next lemma give us a bound over the label of the arcs in terms of the function 
£(•)• 

Lemma 6. For all pairs of adjacent vertices u and v, l(uv) < a(u). Moreover, 
if l(uv) < a(u) then g(v) — e and if l(uv) — a(u) then g(v) = g(u)l(uv). 

Proof. g(u) is a suffix of u, and ul(uv) € W" F (n + 1), so g(u)l{uv) is a prefix of 
a word in W^(n +1). Since m is the maximal word and g(u) is a prefix of m we 
get l(uv) < a(u). 

If l(uv) = a(u) then g{u)l(uv) is a prefix of m and a suffix of v. Hence 
g(u)l(uv) is a suffix of g(v). Since by removing the last letter of a suffix of v we 
obtain a suffix of u we conclude g(v) = g(u)l(uv). 



We show that if g(v) ^ e then a(u) > l(uv). Let g(v) — g'(v)l(uv), then 
g'(v) is a suffix of u and a prefix of to. Hence g'(v) is a suffix of g{u). Therefore 
g'(v)a(u) is a factor of m. By the definition of g(v) and the maximality of to 
g(v) is greater or equal (lexicographically) than g'(v)a(u). We conclude that 
a(u) > l(uv). □ 

In the unrestricted case, where T is a tree of depth n, all the arcs not in T 
go to a leaf. In the general case we can define an analog to the leaves. 

We say that a vertex u is a floor vertex if g(u) — e. Notice that in the 
unrestricted case the leaves of T are the floor vertices. We say that a vertex u is 
a restricted vertex if j(u) < a(u). 

Corollary 7. // a cycle in T contains I restricted vertices, then it has exactly I 
floor vertices. 

Proof. From Lemma 6 we know that if a vertex u is restricted then for every arc 
(u, v) the vertex v is a floor vertex. To conclude it is enough to see that in T an 
arc (u, v) with u unrestricted has label a(u). Then v is not a floor vertex. □ 

Corollary 8. Let P be a path in T starting in a floor vertex, ending in a vertex 
v and with unrestricted inner vertices. Then l(P) — g(v). 

Proof. We apply induction on the length of P. The case where the length of P 
is zero is direct since v is a floor vertex. Let us consider the case where P has 
length at least 1 . Since v is not a restricted vertex, from Lemma 6 we know that 
g(y) = g(u)l(uv), where u is its neighbor in P. By the induction assumption 
g(u) — l(P') where P' is the path obtained from P removing the arc (u,v). 
Hence g{v) = l{P')l(uv) = 1{P). □ 

We will use these results to characterize the label of cycles in T, specially we 
will characterize the restricted vertices of a cycle. 

Theorem 9. Let C be a cycle in T, let u°, . . . , u k ~ x be the restricted vertices in 
C ordered according to the order ofC. Then u l = g(u l+1 )-f(u l+1 ) ■ ■ ■ r )(u l ~ 1 )g{u l ) 
for i = 0, . . . , k — 1, where i + 1, . . . , i — 1 are computed mod A:. 

Proof. From Corollary 8 the label of C is g(u°)j(u°) ■ ■ ■ g(u fe_1 ) 7(u fe ~ 1 ), and 
by definition of G^' n , u 1 is the label of any walk over G T n of length n finishing 
in u l , so by Corollary 5 we can take the walk C k composed by k = (n + 1)/|C| 
repetitions of C finishing in u l , concluding that u' 1 = q(u' l+1 )-f(u' l+1 ) ■ ■ ■ j(u l ) 
l(C k - 1 )g(u 1 )---y(u i - 1 )g(u i ). □ 

Now we are able to give a characterization of the languages where a minimal 
walk produces an Eulerian cycle. 

Let Ti be the subset of W^(n + 1) where w € TL if and only if w can be 
decomposed by w = h°f3\ . . . h k ^ 1 f3k-i where each h % G A* and fa € A satisfy 
the following conditions: 



1. h l =m\. . . m|fci| (a prefix of to) 




Fig. 2. Example of the subgraph T for n — 4 and T = {01111} in a binary alphabet. 



2. Pi < TO| h i| +1 

3. V/3' > A, ^ +1 /3 l+ i . . . Pi-itip' £ W^(n + 1) 

Now, we are able to characterize the languages where a minimal walk is an 
Eulcrian cycle. 

Theorem 10. A minimal walk is an Eulerian cycle if and only ifTL = 0. 

Proof. From Theorem 3, we have to prove that T is a tree if and only if H = 0. 

If T is not a tree then T has a cycle C. Let u° . . .u k ^ 1 be the restricted 
vertices of the cycle. By Theorem 9 1(C) = g(u°)j(u°) . . . g(u k ~ 1 )'y(u k ~ 1 ) and 
by Corollary 5 \C\ divides n + 1. Therefore there exists a word w in W jr (n + 1) 
composed by (n + 1)/|C| repetitions of C. By definition of H we conclude that 

w e n. 

Conversely, let us assume that T has no cycles and W^f). Let robea word 
in H. By definition of G T ' n , there is a cycle C in G^'™ of length dividing n + 1 
such that C (or repetitions of C) has label w. We shall prove that C is also a 
cycle in T. 

Let v be a vertex of C, with u = . . . [3 i ^i(h 1 )i . . . (h l )j where j — 0. . . \h % \. 
If < j < \h l \, then m\ . . . mj is a suffix of v, so a(u) = m^+i = (h l )j + i hence 
the arc of C with tail at v is in T. If j = then 7(1;) = mi therefore the arc 
in C is in T. Finally, let consider the case j — \h l \.li (v, v') is the arc in C then 
l(vv') = Pi. Since w G H, no arc in G^'™ with tail at v has a label greater 
than Pi. Then G T. We conclude that C is a cycle in T which leads to a 

contradiction. □ 



4 Some Remarks 



The previous analysis considers only the minimal walk starting at the root ver- 
tex. This case does not necessarily produce the minimal label over all Eulerian 
cycles, because there can be Eulerian cycles starting at a non root vertex with 
a lexicographically lower label. 

It is also possible to construct an algorithm which modifies T in order to de- 
stroy cycles in T, and obtain the minimal de Bruijn sequence for any irreducible 
subshift of finite type. However further research in this subject allow us to con- 
struct an algorithm to obtain the minimal Eulerian cycle for any edge-labeled 
digraph (see [12]), but this result escapes to the scope of this work. 
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